[plan] Extract ResourceCache and PythonResourceBridge from AgentPlan by weiqingy · Pull Request #548 · apache/flink-agents

weiqingy · 2026-02-22T03:39:11Z

Linked issue: #547

Purpose of change

AgentPlan (624 lines) mixes plan definition, resource caching/resolution, Python bridge wiring, and serialization.
This PR extracts two classes to separate concerns:

ResourceCache — lazy resource resolution, caching, and cleanup. Created by the operator in open(), owned by
the operator lifecycle.
PythonResourceBridge — static discoverPythonMCPResources() for Python MCP tool/prompt discovery. Called
during operator init.

After extraction, AgentPlan becomes immutable after construction (~490 lines, down from 624). The removed public
methods are getResource(), close(), and setPythonResourceAdapter().

Tests

mvn test -pl plan — all plan module tests pass
mvn test -pl runtime — all runtime module tests pass
./tools/lint.sh -c — formatting check passed
./tools/ut.sh -j — full Java test suite passed

API

Yes. Three public methods removed from AgentPlan:

getResource(String, ResourceType) — replaced by ResourceCache.getResource()
close() — replaced by ResourceCache.close()
setPythonResourceAdapter(PythonResourceAdapter) — replaced by PythonResourceBridge.discoverPythonMCPResources()

Documentation

doc-needed
doc-not-needed
doc-included

weiqingy · 2026-02-22T04:14:28Z

Both CI failures are unrelated to our changes:

it-python [flink-2.2] — Ollama download returned HTTP 404. This is a transient infrastructure issue — the Ollama release artifact was temporarily unavailable. Nothing to fix on our side.
cross-language — Py4JError / Py4JNetworkError in the cross-language e2e tests. The Java gateway crashed during test_java_chat_model_integration and subsequent tests couldn't connect. Our PR only touches plan/ and runtime/ Java code — it doesn't affect the Python-Java bridge or cross-language e2e test infrastructure.

wenjin272

Thanks for your work @weiqingy and I think the separation makes sense.

Besides, I think we should also apply this separation in the python side. If possible, could you implement this in this PR?

wenjin272 · 2026-02-27T10:17:26Z

plan/src/main/java/org/apache/flink/agents/plan/ResourceCache.java

+public class ResourceCache implements AutoCloseable {
+
+    private final Map<ResourceType, Map<String, ResourceProvider>> resourceProviders;
+    private final Map<ResourceType, Map<String, Resource>> cache = new HashMap<>();


Should we use ConcurrentHashMap here? For task submit by ctx.durableExecuteAsync may read/write this hashmap parallel.

Good question! This is intentional. After the refactoring, ResourceCache is created and owned by ActionExecutionOperator.open(), so it's scoped to a single operator subtask. In Flink's execution model, all access (processElement, open, close) runs on the same mailbox thread. durableExecuteAsync dispatches work to external threads, but resource resolution from the cache happens on the operator thread before dispatch. So HashMap is sufficient and avoids the overhead of ConcurrentHashMap. Let me know if this matches your understanding.

but resource resolution from the cache happens on the operator thread before dispatch

I think the resource resolution may not always happens before dispatch.

Take the built in chat action as an example. In chat action, we submit chat task to external threads, and the chat task will call chat method of BaseChatModelSetup, which occurs in an asynchronous thread. In the chat method of chat model setup, it will resolves the correspond connection, prompt and tools, which I think may lead to concurrent access to the cache.

You're right, thanks for pushing back on this. I traced the code path more carefully:

ChatModelAction → durableExecuteAsync(callable) → async pool thread runs chatModel.chat() → BaseChatModelSetup.chat() calls this.getResource.apply() to resolve connection, prompt, and tools → which hits ResourceCache.getResource().

So resource resolution does happen on async threads, not just the mailbox thread. I'll switch back to ConcurrentHashMap. Good catch!

Updated the PR.

weiqingy · 2026-03-02T06:02:59Z

Thanks for the review, @wenjin272! I’ve updated the PR to apply the same separation on the Python side as well - could you please take another look?

wenjin272

LGTM. Could you take a look at your convenience @xintongsong ?

weiqingy · 2026-03-02T07:25:12Z

I checked the CI failures - both are LLM-dependent e2e tests and don’t appear to be caused by this PR.

Test 1 (react_agent_test): The output 4444 = 2123 + 2321 proves our ResourceCache IS working correctly — the chat model was resolved, the add tool was resolved and called successfully. The LLM (qwen3:1.7b) simply stopped after one tool call instead of continuing to call multiply(4444, 312). This is LLM non-determinism.

Test 2 (long_term_memory_test): This runs on the Flink remote runner, where there's exactly ONE FlinkRunnerContext with ONE ResourceCache. The behavior is identical to before. The failure is assert len(doc) == 1 after LLM-based compaction using qwen3:8b — if the model's summarization response is malformed, compaction produces incorrect output.

We can re-run CI to confirm flakiness — if it fails again with different assertion values, that would further support LLM non-determinism.

@wenjin272 do you have access to re-run the CI tests? It looks like admin rights are required.

wenjin272 · 2026-03-02T07:51:56Z

Hi, @weiqingy, sorry for I don't have access to re-run the CI. I acknowledge that the failing test is due to its own flakiness and not caused by this PR.

[plan] Extract ResourceCache and PythonResourceBridge from AgentPlan

acf2c36

github-actions bot added priority/major Default priority of the PR or issue. fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. doc-not-needed Your PR changes do not impact docs labels Feb 22, 2026

xintongsong requested a review from wenjin272 February 24, 2026 09:33

wenjin272 reviewed Feb 28, 2026

View reviewed changes

Add refactor for python

90bb60b

Use ConcurrentHashMap instead

704c77a

wenjin272 approved these changes Mar 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[plan] Extract ResourceCache and PythonResourceBridge from AgentPlan#548

[plan] Extract ResourceCache and PythonResourceBridge from AgentPlan#548
weiqingy wants to merge 3 commits intoapache:mainfrom
weiqingy:issue_547

weiqingy commented Feb 22, 2026

Uh oh!

weiqingy commented Feb 22, 2026

Uh oh!

wenjin272 left a comment

Uh oh!

wenjin272 Feb 27, 2026

Uh oh!

weiqingy Mar 2, 2026

Uh oh!

wenjin272 Mar 2, 2026 •

edited

Loading

Uh oh!

weiqingy Mar 2, 2026

Uh oh!

weiqingy Mar 2, 2026

Uh oh!

weiqingy commented Mar 2, 2026

Uh oh!

wenjin272 left a comment

Uh oh!

weiqingy commented Mar 2, 2026

Uh oh!

wenjin272 commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weiqingy commented Feb 22, 2026

Purpose of change

Tests

API

Documentation

Uh oh!

weiqingy commented Feb 22, 2026

Uh oh!

wenjin272 left a comment

Choose a reason for hiding this comment

Uh oh!

wenjin272 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

wenjin272 Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weiqingy Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy commented Mar 2, 2026

Uh oh!

wenjin272 left a comment

Choose a reason for hiding this comment

Uh oh!

weiqingy commented Mar 2, 2026

Uh oh!

wenjin272 commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wenjin272 Mar 2, 2026 •

edited

Loading